<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
<head>
    <meta charset="utf-8">
    <meta name="citation_title" content="Neural Networks and Deep Learning">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2015">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <meta name="citation_fulltext_world_readable" content="">
    <link rel="icon" href="http://neuralnetworksanddeeplearning.com/nnadl_favicon.ICO" />
    <title>Neural networks and deep learning</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS": 
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">
    <link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="chap4.html">CHAPTER 4</a></h1>
  <h1 class="chapter_title"><a href="chap4.html">A visual proof that neural nets can compute any function</a></h1></div><div class="section"><div id="toc"> 
<p class="toc_title"><a href="index.html">Neural Networks and Deep Learning</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">Using neural nets to recognize handwritten digits</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() { 
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">How the backpropagation algorithm works</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() { 
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">Improving the way neural networks learn</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() { 
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">A visual proof that neural nets can compute any function</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() { 
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">Why are deep neural networks hard to train?</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() { 
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});  
});</script><p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">Deep learning</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() { 
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});  
});</script><p class="toc_not_mainchapter"><a href="sai.html">Appendix: Is there a <em>simple</em> algorithm for intelligence?</a></p><p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $5, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="hosted_button_id" value="5K9YAHR4X84RN">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

<p class="sidebar">Alternately, you can make a donation by sending me
Bitcoin, at address <span style="font-size: 0.7em">1Kd6tXH5SDAmiFb49J9hknG5pqj7KStSAx</span></p>

<!--
<hr>

<p class="sidebar"> If you benefit from the book, please make a small
donation.  I suggest $3, but you can choose the amount.</p>

<form action="https://www.paypal.com/cgi-bin/webscr" method="post" target="_top">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="encrypted" value="-----BEGIN PKCS7-----MIIHTwYJKoZIhvcNAQcEoIIHQDCCBzwCAQExggEwMIIBLAIBADCBlDCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20CAQAwDQYJKoZIhvcNAQEBBQAEgYAtusFIFTgWVpgZsMgI9zMrWRAFFKQqeFiE6ay1nbmP360YzPtR+vvCXwn214Az9+F9g7mFxe0L+m9zOCdjzgRROZdTu1oIuS78i0TTbcbD/Vs/U/f9xcmwsdX9KYlhimfsya0ydPQ2xvr4iSGbwfNemIPVRCTadp/Y4OQWWRFKGTELMAkGBSsOAwIaBQAwgcwGCSqGSIb3DQEHATAUBggqhkiG9w0DBwQIK5obVTaqzmyAgajgc4w5t7l6DjTGVI7k+4UyO3uafxPac23jOyBGmxSnVRPONB9I+/Q6OqpXZtn8JpTuzFmuIgkNUf1nldv/DA1mhPOeeVxeuSGL8KpWxpJboKZ0mEu9b+0FJXvZW+snv0jodnRDtI4g0AXDZNPyRWIdJ3m+tlYfsXu4mQAe0q+CyT+QrSRhPGI/llicF4x3rMbRBNqlDze/tFqp/jbgW84Puzz6KyxAez6gggOHMIIDgzCCAuygAwIBAgIBADANBgkqhkiG9w0BAQUFADCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wHhcNMDQwMjEzMTAxMzE1WhcNMzUwMjEzMTAxMzE1WjCBjjELMAkGA1UEBhMCVVMxCzAJBgNVBAgTAkNBMRYwFAYDVQQHEw1Nb3VudGFpbiBWaWV3MRQwEgYDVQQKEwtQYXlQYWwgSW5jLjETMBEGA1UECxQKbGl2ZV9jZXJ0czERMA8GA1UEAxQIbGl2ZV9hcGkxHDAaBgkqhkiG9w0BCQEWDXJlQHBheXBhbC5jb20wgZ8wDQYJKoZIhvcNAQEBBQADgY0AMIGJAoGBAMFHTt38RMxLXJyO2SmS+Ndl72T7oKJ4u4uw+6awntALWh03PewmIJuzbALScsTS4sZoS1fKciBGoh11gIfHzylvkdNe/hJl66/RGqrj5rFb08sAABNTzDTiqqNpJeBsYs/c2aiGozptX2RlnBktH+SUNpAajW724Nv2Wvhif6sFAgMBAAGjge4wgeswHQYDVR0OBBYEFJaffLvGbxe9WT9S1wob7BDWZJRrMIG7BgNVHSMEgbMwgbCAFJaffLvGbxe9WT9S1wob7BDWZJRroYGUpIGRMIGOMQswCQYDVQQGEwJVUzELMAkGA1UECBMCQ0ExFjAUBgNVBAcTDU1vdW50YWluIFZpZXcxFDASBgNVBAoTC1BheVBhbCBJbmMuMRMwEQYDVQQLFApsaXZlX2NlcnRzMREwDwYDVQQDFAhsaXZlX2FwaTEcMBoGCSqGSIb3DQEJARYNcmVAcGF5cGFsLmNvbYIBADAMBgNVHRMEBTADAQH/MA0GCSqGSIb3DQEBBQUAA4GBAIFfOlaagFrl71+jq6OKidbWFSE+Q4FqROvdgIONth+8kSK//Y/4ihuE4Ymvzn5ceE3S/iBSQQMjyvb+s2TWbQYDwcp129OPIbD9epdr4tJOUNiSojw7BHwYRiPh58S1xGlFgHFXwrEBb3dgNbMUa+u4qectsMAXpVHnD9wIyfmHMYIBmjCCAZYCAQEwgZQwgY4xCzAJBgNVBAYTAlVTMQswCQYDVQQIEwJDQTEWMBQGA1UEBxMNTW91bnRhaW4gVmlldzEUMBIGA1UEChMLUGF5UGFsIEluYy4xEzARBgNVBAsUCmxpdmVfY2VydHMxETAPBgNVBAMUCGxpdmVfYXBpMRwwGgYJKoZIhvcNAQkBFg1yZUBwYXlwYWwuY29tAgEAMAkGBSsOAwIaBQCgXTAYBgkqhkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNTA4MDUxMzMyMTRaMCMGCSqGSIb3DQEJBDEWBBRtGLYvbZ45sWVegWVP2CuXTHPmJTANBgkqhkiG9w0BAQEFAASBgKgrMHMINfV7yVuZgcTjp8gUzejPF2x2zRPU/G8pKUvYIl1F38TjV2pe4w0QXcGMJRT8mQfxHCy9UmF3LfblH8F0NSMMDrZqu3M0eLk96old+L0Xl6ING8l3idFDkLagE+lZK4A0rNV35aMci3VLvjQ34CvEj7jaHeLpbkgk/l6v-----END PKCS7-----
">
<input type="image" src="https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online!">
<img alt="" border="0" src="https://www.paypalobjects.com/en_US/i/scr/pixel.gif" width="1" height="1">
</form>

-->

<hr>
<span class="sidebar_title">Sponsors</span>
<br/>

<a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rbannerimg">
  <img src="assets/lambda.png" width="200px" style="padding: 3px 0px 0px 10px; border-style: none;">
</a>
<br>
<div style="line-height: 1.2; padding-bottom: 12px; font-size: 0.8;">
  <a href="https://lambdalabs.com/?utm_source=neuralnetworksdeeplearning&utm_medium=banner&utm_campaign=blogin&utm_content=rtext">Deep Learning Workstations, Servers, and Laptops</a>
</div>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='200px' style="padding: 5px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='200px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='210px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/> 

<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible, with
especial thanks to Pavel Dudrenov.  Thanks also to all the
contributors to the <a href="bugfinder.html">Bugfinder Hall of
Fame</a>.  </p>

<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar"><a href="https://twitter.com/michael_nielsen">Michael Nielsen on Twitter</a></p>

<p class="sidebar"><a href="faq.html">Book FAQ</a></p>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="https://michaelnielsenupdates.substack.com/subscribe">Michael Nielsen's project announcement mailing list</a>
</p>

<p class="sidebar"> <a href="http://www.deeplearningbook.org/">Deep Learning</a>, book by Ian
Goodfellow, Yoshua Bengio, and Aaron Courville</p>

<p class="sidebar"><a href="http://cognitivemedium.com">cognitivemedium.com</a></p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
By <a href="http://michaelnielsen.org">Michael Nielsen</a> / Dec 2019
</p>
</div>
</p><p>One of the most striking facts about neural networks is that they cancompute any function at all.  That is, suppose someone hands you somecomplicated, wiggly function, $f(x)$:</p><p><center><canvas id="function" width="300" height="300"></canvas></center></p><p><a id="basic_network_precursor"></a> No matter what thefunction, there is guaranteed to be a neural network so that for everypossible input, $x$, the value $f(x)$ (or some close approximation) isoutput from the network, e.g.:</p><p><center><canvas id="basic_network" width="350" height="220"></canvas></center></p><p>This result holds even if the function has many inputs, $f = f(x_1,\ldots, x_m)$, and many outputs.  For instance, here's a networkcomputing a function with $m = 3$ inputs and $n = 2$ outputs:</p><p><center><canvas id="vector_valued_network" width="450" height="370"></canvas></center></p><p>This result tells us that neural networks have a kind of<em>universality</em>.  No matter what function we want to compute, weknow that there is a neural network which can do the job.</p><p>What's more, this universality theorem holds even if we restrict ournetworks to have just a single layer intermediate between the inputand the output neurons - a so-called single hidden layer.  So evenvery simple network architectures can be extremely powerful.</p><p>The universality theorem is well known by people who use neuralnetworks.  But why it's true is not so widely understood.  Most of theexplanations available are quite technical.  For instance, one of theoriginal papers proving theresult*<span class="marginnote">*<a href="http://www.dartmouth.edu/&#126;gvc/Cybenko_MCSS.pdf">Approximation    by superpositions of a sigmoidal function</a>, by George Cybenko  (1989).  The result was very much in the air at the time, and  several groups proved closely related results.  Cybenko's paper  contains a useful discussion of much of that work.  Another  important early paper is  <a href="http://www.sciencedirect.com/science/article/pii/0893608089900208">Multilayer    feedforward networks are universal approximators</a>, by Kurt Hornik,  Maxwell Stinchcombe, and Halbert White (1989).  This paper uses the  Stone-Weierstrass theorem to arrive at similar results.</span>  did sousing the Hahn-Banach theorem, the Riesz Representation theorem, andsome Fourier analysis.  If you're a mathematician the argument is notdifficult to follow, but it's not so easy for most people.  That's apity, since the underlying reasons for universality are simple andbeautiful.</p><p>In this chapter I give a simple and mostly visual explanation of theuniversality theorem.  We'll go step by step through the underlyingideas.  You'll understand why it's true that neural networks cancompute any function.  You'll understand some of the limitations ofthe result.  And you'll understand how the result relates to deepneural networks.</p><p>To follow the material in the chapter, you do not need to have readearlier chapters in this book.  Instead, the chapter is structured tobe enjoyable as a self-contained essay.  Provided you have just alittle basic familiarity with neural networks, you should be able tofollow the explanation.  I will, however, provide occasional links toearlier material, to help fill in any gaps in your knowledge.</p><p></p><p></p><p></p><p>Universality theorems are a commonplace in computer science, so muchso that we sometimes forget how astonishing they are.  But it's worthreminding ourselves: the ability to compute an arbitrary function istruly remarkable.  Almost any process you can imagine can be thoughtof as function computation.  Consider the problem of naming a piece ofmusic based on a short sample of the piece.  That can be thought of ascomputing a function.  Or consider the problem of translating aChinese text into English.  Again, that can be thought of as computinga function*<span class="marginnote">*Actually, computing one of many functions, since  there are often many acceptable translations of a given piece of  text.</span>.  Or consider the problem of taking an mp4 movie file andgenerating a description of the plot of the movie, and a discussion ofthe quality of the acting.  Again, that can be thought of as a kind offunction computation*<span class="marginnote">*Ditto the remark about translation and  there being many possible functions.</span>.  Universality means that, inprinciple, neural networks can do all these things and many more.</p><p>Of course, just because we know a neural network exists that can (say)translate Chinese text into English, that doesn't mean we have goodtechniques for constructing or even recognizing such a network.  Thislimitation applies also to traditional universality theorems formodels such as Boolean circuits.  But, as we've seen earlier in thebook, neural networks have powerful algorithms for learning functions.That combination of learning algorithms + universality is anattractive mix.  Up to now, the book has focused on the learningalgorithms.  In this chapter, we focus on universality, and what itmeans.</p><p><h3><a name="two_caveats"></a><a href="chap4.html#two_caveats">Two caveats</a></h3></p><p>Before explaining why the universality theorem is true, I want tomention two caveats to the informal statement "a neural network cancompute any function".</p><p>First, this doesn't mean that a network can be used to <em>exactly</em>compute any function. Rather, we can get an <em>approximation</em>that is as good as we want.  By increasing the number of hiddenneurons we can improve the approximation.  For instance,<a href="chap4.html#basic_network_precursor">earlier</a> I illustrated a networkcomputing some function $f(x)$ using three hidden neurons.  For mostfunctions only a low-quality approximation will be possible usingthree hidden neurons.  By increasing the number of hidden neurons(say, to five) we can typically get a better approximation:</p><p><center><canvas id="bigger_network" width="350" height="380"></canvas></center></p><p>And we can do still better by further increasing the number of hiddenneurons.  </p><p>To make this statement more precise, suppose we're given a function$f(x)$ which we'd like to compute to within some desired accuracy$\epsilon > 0$.  The guarantee is that by using enough hidden neuronswe can always find a neural network whose output $g(x)$ satisfies$|g(x) - f(x)| < \epsilon$, for all inputs $x$.  In other words, theapproximation will be good to within the desired accuracy for everypossible input.</p><p>The second caveat is that the class of functions which can beapproximated in the way described are the <em>continuous</em> functions.If a function is discontinuous, i.e., makes sudden, sharp jumps, thenit won't in general be possible to approximate using a neural net.This is not surprising, since our neural networks compute continuousfunctions of their input.  However, even if the function we'd reallylike to compute is discontinuous, it's often the case that acontinuous approximation is good enough.  If that's so, then we canuse a neural network.  In practice, this is not usually an importantlimitation.</p><p>Summing up, a more precise statement of the universality theorem isthat neural networks with a single hidden layer can be used toapproximate any continuous function to any desired precision.  In thischapter we'll actually prove a slightly weaker version of this result,using two hidden layers instead of one.  In the problems I'll brieflyoutline how the explanation can, with a few tweaks, be adapted to givea proof which uses only a single hidden layer. <h3><a name="universality_with_one_input_and_one_output"></a><a href="chap4.html#universality_with_one_input_and_one_output">Universality with one input and one output</a></h3></p><p>To understand why the universality theorem is true, let's start byunderstanding how to construct a neural network which approximates afunction with just one input and one output:</p><p><center><canvas id="function_2" width="300" height="300"></canvas></center></p><p>It turns out that this is the core of the problem of universality.Once we've understood this special case it's actually pretty easy toextend to functions with many inputs and many outputs.</p><p>To build insight into how to construct a network to compute $f$, let'sstart with a network containing just a single hidden layer, with twohidden neurons, and an output layer containing a single output neuron:</p><p><center><canvas id="two_hidden_neurons" width="350" height="220"></canvas></center></p><p>To get a feel for how components in the network work, let's focus onthe top hidden neuron.  In the diagram below, click on the weight,$w$, and drag the mouse a little ways to the right to increase $w$.You can immediately see how the function computed by the top hiddenneuron changes:</p><p><center><canvas id="basic_manipulation" width="600" height="285"></canvas></center></p><p>As we learnt <a href="chap1.html#sigmoid_neurons">earlier in the book</a>,what's being computed by the hidden neuron is $\sigma(wx + b)$, where$\sigma(z) \equiv 1/(1+e^{-z})$ is the sigmoid function.  Up to now,we've made frequent use of this algebraic form.  But for the proof ofuniversality we will obtain more insight by ignoring the algebraentirely, and instead manipulating and observing the shape shown inthe graph.  This won't just give us a better feel for what's going on,it will also give us a proof*<span class="marginnote">*Strictly speaking, the visual  approach I'm taking isn't what's traditionally thought of as a  proof.  But I believe the visual approach gives more insight into  why the result is true than a traditional proof.  And, of course,  that kind of insight is the real purpose behind a proof.  Occasionally, there will be small gaps in the reasoning I present:  places where I make a visual argument that is plausible, but not  quite rigorous.  If this bothers you, then consider it a challenge  to fill in the missing steps.  But don't lose sight of the real  purpose: to understand why the universality theorem is true.</span> ofuniversality that applies to activation functions other than thesigmoid function.</p><p>To get started on this proof, try clicking on the bias, $b$, in thediagram above, and dragging to the right to increase it.  You'll seethat as the bias increases the graph moves to the left, but its shapedoesn't change.</p><p>Next, click and drag to the left in order to decrease the bias.You'll see that as the bias decreases the graph moves to the right,but, again, its shape doesn't change.</p><p>Next, decrease the weight to around $2$ or $3$.  You'll see that asyou decrease the weight, the curve broadens out.  You might need tochange the bias as well, in order to keep the curve in-frame.</p><p>Finally, increase the weight up past $w = 100$.  As you do, the curvegets steeper, until eventually it begins to look like a step function.Try to adjust the bias so the step occurs near $x = 0.3$.  Thefollowing short clip shows what your result should look like.  Clickon the play button to play (or replay) the video:</p><p><!-- Based on http://worrydream.com/ScrubbingCalculator/, with minor changes -->      <script type="text/javascript">    	function playVideo (name) {    		var div = $("#"+name)[0];     		div.style.backgroundColor = "transparent";    		div.style.cursor = "default";    		div.getElementsByTagName("img")[0].style.display = "none";    		var video = $("#v" + name)[0];    		video.play();    	}    	function videoEnded (name) {    		var div = document.getElementById(name);    		div.getElementsByTagName("img")[0].style.display = "block";	        div.style.backgroundColor = "white";	        div.style.opacity = 0.6;    		div.style.cursor = "pointer";    	}      </script>      <div>	<div id="a" class="videoOverlay" 	     style="width: 560px; height: 280px; opacity: 0.8" 	     onclick="playVideo('a');">	  <img style="left: 210px; top: 75px;" 	       src="images/play.png" width="128px">	</div>	  <video id="va" width="560" height="280" preload 		 onended="videoEnded('a');">	    <source type="video/webm"		    src="movies/create_step_function.webm">	    <source type="video/mp4"		    src="movies/create_step_function.mp4">	  </video></div></p><p>We can simplify our analysis quite a bit by increasing the weight somuch that the output really is a step function, to a very goodapproximation.  Below I've plotted the output from the top hiddenneuron when the weight is $w = 999$.  Note that this plot is static,and you can't change parameters such as the weight.</p><p><img src="images/high_weight_function.jpg"></p><p>It's actually quite a bit easier to work with step functions thangeneral sigmoid functions.  The reason is that in the output layer weadd up contributions from all the hidden neurons.  It's easy toanalyze the sum of a bunch of step functions, but rather moredifficult to reason about what happens when you add up a bunch ofsigmoid shaped curves.  And so it makes things much easier to assumethat our hidden neurons are outputting step functions.  Moreconcretely, we do this by fixing the weight $w$ to be some very largevalue, and then setting the position of the step by modifying thebias.  Of course, treating the output as a step function is anapproximation, but it's a very good approximation, and for now we'lltreat it as exact.  I'll come back later to discuss the impact ofdeviations from this approximation.</p><p>At what value of $x$ does the step occur?  Put another way, how doesthe position of the step depend upon the weight and bias?</p><p>To answer this question, try modifying the weight and bias in thediagram above (you may need to scroll back a bit).  Can you figure outhow the position of the step depends on $w$ and $b$?  With a littlework you should be able to convince yourself that the position of thestep is <em>proportional</em> to $b$, and <em>inversely proportional</em>to $w$.</p><p>In fact, the step is at position $s = -b/w$, as you can see bymodifying the weight and bias in the following diagram:</p><p><canvas id="step" width="600" height="285"></canvas></p><p>It will greatly simplify our lives to describe hidden neurons usingjust a single parameter, $s$, which is the step position, $s = -b/w$.Try modifying $s$ in the following diagram, in order to get used tothe new parameterization:</p><p><canvas id="step_parameterization" width="600" height="285"></canvas></p><p>As noted above, we've implicitly set the weight $w$ on the input to besome large value - big enough that the step function is a very goodapproximation.  We can easily convert a neuron parameterized in thisway back into the conventional model, by choosing the bias $b = -w s$.</p><p>Up to now we've been focusing on the output from just the top hiddenneuron.  Let's take a look at the behavior of the entire network.  Inparticular, we'll suppose the hidden neurons are computing stepfunctions parameterized by step points $s_1$ (top neuron) and $s_2$(bottom neuron).  And they'll have respective output weights $w_1$ and$w_2$.  Here's the network:</p><p><canvas id="two_hn_network" width="600" height="285"></canvas></p><p>What's being plotted on the right is the <em>weighted output</em> $w_1a_1 + w_2 a_2$ from the hidden layer.  Here, $a_1$ and $a_2$ are theoutputs from the top and bottom hidden neurons,respectively*<span class="marginnote">*Note, by the way, that the output from the whole  network is $\sigma(w_1 a_1+w_2 a_2 + b)$, where $b$ is the bias on  the output neuron.  Obviously, this isn't the same as the weighted  output from the hidden layer, which is what we're plotting here.  We're going to focus on the weighted output from the hidden layer  right now, and only later will we think about how that relates to  the output from the whole network.</span>.  These outputs are denoted with$a$s because they're often known as the neurons' <em>activations</em>.</p><p>Try increasing and decreasing the step point $s_1$ of the top hiddenneuron.  Get a feel for how this changes the weighted output from thehidden layer.	It's particularly worth understanding what happens when $s_1$ goespast $s_2$.  You'll see that the graph changes shape when thishappens, since we have moved from a situation where the top hiddenneuron is the first to be activated to a situation where the bottomhidden neuron is the first to be activated.</p><p>Similarly, try manipulating the step point $s_2$ of the bottom hiddenneuron, and get a feel for how this changes the combined output fromthe hidden neurons.</p><p>Try increasing and decreasing each of the output weights.  Notice howthis rescales the contribution from the respective hidden neurons.What happens when one of the weights is zero?</p><p>Finally, try setting $w_1$ to be $0.8$ and $w_2$ to be $-0.8$.  Youget a "bump" function, which starts at point $s_1$, ends at point$s_2$, and has height $0.8$.  For instance, the weighted output mightlook like this:</p><p><img src="images/bump_function.jpg"></p><p>Of course, we can rescale the bump to have any height at all.  Let'suse a single parameter, $h$, to denote the height.  To reduce clutterI'll also remove the "$s_1 = \ldots$" and "$w_1 = \ldots$" notations.</p><p><canvas id="bump_fn" width="600" height="285"></canvas></p><p>Try changing the value of $h$ up and down, to see how the height ofthe bump changes.  Try changing the height so it's negative, andobserve what happens.  And try changing the step points to see howthat changes the shape of the bump.</p><p>You'll notice, by the way, that we're using our neurons in a way thatcan be thought of not just in graphical terms, but in moreconventional programming terms, as a kind of <tt>if-then-else</tt>statement, e.g.:</p><p><div class="highlight"><pre><span></span>    <span class="k">if</span> input &gt;<span class="o">=</span> step point:
        add <span class="m">1</span> to the weighted output
    <span class="k">else</span>:
        add <span class="m">0</span> to the weighted output
</pre></div>
</p><p>For the most part I'm going to stick with the graphical point of view.But in what follows you may sometimes find it helpful to switch pointsof view, and think about things in terms of <tt>if-then-else</tt>.</p><p>We can use our bump-making trick to get two bumps, by gluing two pairsof hidden neurons together into the same network:</p><p><canvas id="double_bump" width="600" height = "280"></canvas></p><p>I've suppressed the weights here, simply writing the $h$ values foreach pair of hidden neurons.  Try increasing and decreasing both $h$values, and observe how it changes the graph.  Move the bumps aroundby changing the step points.</p><p>More generally, we can use this idea to get as many peaks as we want,of any height.  In particular, we can divide the interval $[0, 1]$ upinto a large number, $N$, of subintervals, and use $N$ pairs of hiddenneurons to set up peaks of any desired height.  Let's see how thisworks for $N = 5$.  That's quite a few neurons, so I'm going to packthings in a bit.  Apologies for the complexity of the diagram: I couldhide the complexity by abstracting away further, but I think it'sworth putting up with a little complexity, for the sake of getting amore concrete feel for how these networks work.</p><p><canvas id="five_bumps" width="600" height = "620"></canvas></p><p>You can see that there are five pairs of hidden neurons.  The steppoints for the respective pairs of neurons are $0, 1/5$, then $1/5,2/5$, and so on, out to $4/5, 5/5$.  These values are fixed - theymake it so we get five evenly spaced bumps on the graph.</p><p>Each pair of neurons has a value of $h$ associated to it.  Remember,the connections output from the neurons have weights $h$ and $-h$ (notmarked).  Click on one of the $h$ values, and drag the mouse to theright or left to change the value.  As you do so, watch the functionchange.  By changing the output weights we're actually<em>designing</em> the function!</p><p>Contrariwise, try clicking on the graph, and dragging up or down tochange the height of any of the bump functions.  As you change theheights, you can see the corresponding change in $h$ values.  And,although it's not shown, there is also a change in the correspondingoutput weights, which are $+h$ and $-h$.</p><p>In other words, we can directly manipulate the function appearing inthe graph on the right, and see that reflected in the $h$ values onthe left.  A fun thing to do is to hold the mouse button down and dragthe mouse from one side of the graph to the other.  As you do this youdraw out a function, and get to watch the parameters in the neuralnetwork adapt.</p><p>Time for a challenge.</p><p>Let's think back to the function I plotted at the beginning of thechapter:</p><p><center> <canvas id="function_3" width="300" height="300"></canvas></center></p><p>I didn't say it at the time, but what I plotted is actually thefunction<a class="displaced_anchor" name="eqtn113"></a>\begin{eqnarray}f(x) = 0.2+0.4 x^2+0.3x \sin(15 x) + 0.05 \cos(50 x),\tag{113}\end{eqnarray}plotted over $x$ from $0$ to $1$, and with the $y$ axis takingvalues from $0$ to $1$.</p><p>That's obviously not a trivial function.</p><p>You're going to figure out how to compute it using a neural network.</p><p>In our networks above we've been analyzing the weighted combination$\sum_j w_j a_j$ output from the hidden neurons.  We now know how toget a lot of control over this quantity.  But, as I noted earlier,this quantity is not what's output from the network.  What's outputfrom the network is $\sigma(\sum_j w_j a_j + b)$ where $b$ is the biason the output neuron.  Is there some way we can achieve control overthe actual output from the network?</p><p>The solution is to design a neural network whose hidden layer has aweighted output given by $\sigma^{-1} \circ f(x)$, where $\sigma^{-1}$is just the inverse of the $\sigma$ function.  That is, we want theweighted output from the hidden layer to be:</p><p><center> <canvas id="inverted_function" width="340"height="300"></canvas> </center></p><p>If we can do this, then the output from the network as a whole will bea good approximation to $f(x)$*<span class="marginnote">*Note that I have set the bias  on the output neuron to $0$.</span>.</p><p>Your challenge, then, is to design a neural network to approximate thegoal function shown just above.  To learn as much as possible, I wantyou to solve the problem twice.  The first time, please click on thegraph, directly adjusting the heights of the different bump functions.You should find it fairly easy to get a good match to the goalfunction.  How well you're doing is measured by the <em>average  deviation</em> between the goal function and the function the network isactually computing.  Your challenge is to drive the average deviationas <em>low</em> as possible.  You complete the challenge when you drivethe average deviation to $0.40$ or below.</p><p>Once you've done that, click on "Reset" to randomly re-initializethe bumps.  The second time you solve the problem, resist the urge toclick on the graph.  Instead, modify the $h$ values on the left-handside, and again attempt to drive the average deviation to $0.40$ orbelow.</p><p><canvas id="design_function" width="600" height = "620"></canvas></p><p>You've now figured out all the elements necessary for the network toapproximately compute the function $f(x)$!  It's only a coarseapproximation, but we could easily do much better, merely byincreasing the number of pairs of hidden neurons, allowing more bumps.</p><p>In particular, it's easy to convert all the data we have found backinto the standard parameterization used for neural networks.  Let mejust recap quickly how that works.</p><p>The first layer of weights all have some large, constant value, say $w= 1000$.</p><p>The biases on the hidden neurons are just $b = -w s$.  So, forinstance, for the second hidden neuron $s = 0.2$ becomes $b = -1000\times 0.2 = -200$.</p><p>The final layer of weights are determined by the $h$ values.  So, forinstance, the value you've chosen above for the first $h$, $h = $<span id="h" style="font-family: MJX_Main;"></span>, means thatthe output weights from the top two hidden neurons are <span  id="w1" style="font-family: MJX_Main;"></span> and <span  id="w2" style="font-family: MJX_Main;"></span>, respectively.  Andso on, for the entire layer of output weights.</p><p>Finally, the bias on the output neuron is $0$.</p><p>That's everything: we now have a complete description of a neuralnetwork which does a pretty good job computing our original goalfunction.  And we understand how to improve the quality of theapproximation by improving the number of hidden neurons.</p><p>What's more, there was nothing special about our original goalfunction, $f(x) = 0.2+0.4 x^2+0.3 \sin(15 x) + 0.05 \cos(50 x)$.  Wecould have used this procedure for any continuous function from $[0,1]$ to $[0, 1]$.  In essence, we're using our single-layer neuralnetworks to build a lookup table for the function.  And we'll be ableto build on this idea to provide a general proof of universality.</p><p><h3><a name="many_input_variables"></a><a href="chap4.html#many_input_variables">Many input variables</a></h3></p><p>Let's extend our results to the case of many input variables.  Thissounds complicated, but all the ideas we need can be understood in thecase of just two inputs.  So let's address the two-input case.</p><p>We'll start by considering what happens when we have two inputs to aneuron:</p><p><center> <canvas id="two_inputs" width="350" height="220"></canvas></center></p><p>Here, we have inputs $x$ and $y$, with corresponding weights $w_1$ and$w_2$, and a bias $b$ on the neuron.  Let's set the weight $w_2$ to$0$, and then play around with the first weight, $w_1$, and the bias,$b$, to see how they affect the output from the neuron:</p><p><script src="js/three.min.js"></script></p><p><canvas id="ti_graph" width="200" height="220"></canvas><span id="ti_graph_3d" style="position: absolute; left: 260px;"></span></p><p>As you can see, with $w_2 = 0$ the input $y$ makes no difference tothe output from the neuron.  It's as though $x$ is the only input.</p><p>Given this, what do you think happens when we increase the weight$w_1$ to $w_1 = 100$, with $w_2$ remaining $0$?  If you don'timmediately see the answer, ponder the question for a bit, and see ifyou can figure out what happens.  Then try it out and see if you'reright.  I've shown what happens in the following movie:</p><p><div>	<div id="b" class="videoOverlay" 	     style="width: 460px; height: 252px; opacity: 0.8" 	     onclick="playVideo('b');">	  <img style="left: 160px; top: 70px;" 	       src="images/play.png" width="128px">	</div>	  <video id="vb" width="460" height="252" preload 		 onended="videoEnded('b');">	    <source type="video/mp4"		    src="movies/step_3d.mp4">	    <source type="video/webm"		    src="movies/step_3d.webm"></p><p>	  </video></div></p><p>Just as in our earlier discussion, as the input weight gets larger theoutput approaches a step function.  The difference is that now thestep function is in three dimensions.  Also as before, we can move thelocation of the step point around by modifying the bias.  The actuallocation of the step point is $s_x \equiv -b / w_1$.</p><p>Let's redo the above using the position of the step as the parameter:</p><p><canvas id="ti_graph_redux" width="200" height="220"></canvas> <spanid="ti_graph_redux_3d" style="position: absolute; left:260px;"></span></p><p>Here, we assume the weight on the $x$ input has some large value- I've used $w_1 = 1000$ - and the weight $w_2 = 0$.  Thenumber on the neuron is the step point, and the little $x$ above thenumber reminds us that the step is in the $x$ direction.	Of course, it's also possible to get a step function in the $y$direction, by making the weight on the $y$ input very large (say, $w_2= 1000$), and the weight on the $x$ equal to $0$, i.e., $w_1 = 0$:</p><p><canvas id="y_step" width="200" height="220"></canvas> <spanid="y_step_3d" style="position: absolute; left: 260px;"></span></p><p>The number on the neuron is again the step point, and in this case thelittle $y$ above the number reminds us that the step is in the $y$direction.  I could have explicitly marked the weights on the $x$ and$y$ inputs, but decided not to, since it would make the diagram rathercluttered.  But do keep in mind that the little $y$ marker implicitlytells us that the $y$ weight is large, and the $x$ weight is $0$.</p><p>We can use the step functions we've just constructed to compute athree-dimensional bump function.  To do this, we use two neurons, eachcomputing a step function in the $x$ direction.  Then we combine thosestep functions with weight $h$ and $-h$, respectively, where $h$ isthe desired height of the bump.  It's all illustrated in the followingdiagram:</p><p><canvas id="bump_3d" width="300" height="220"></canvas> <spanid="bump_3d_graph" style="position: absolute; left: 360px;"></span></p><p>Try changing the value of the height, $h$. Observe how it relates tothe weights in the network.  And see how it changes the height of thebump function on the right.</p><p>Also, try changing the step point $0.30$ associated to the top hiddenneuron.  Witness how it changes the shape of the bump.  What happenswhen you move it past the step point $0.70$ associated to the bottomhidden neuron?</p><p>We've figured out how to make a bump function in the $x$ direction.Of course, we can easily make a bump function in the $y$ direction, byusing two step functions in the $y$ direction.  Recall that we do thisby making the weight large on the $y$ input, and the weight $0$ on the$x$ input.  Here's the result:</p><p><canvas id="bump_3d_y" width="300" height="220"></canvas> <spanid="bump_3d_y_graph" style="position: absolute; left: 360px;"></span></p><p>This looks nearly identical to the earlier network!  The only thingexplicitly shown as changing is that there's now little $y$ markers onour hidden neurons.  That reminds us that they're producing $y$ stepfunctions, not $x$ step functions, and so the weight is very large onthe $y$ input, and zero on the $x$ input, not vice versa.  As before,I decided not to show this explicitly, in order to avoid clutter.</p><p>Let's consider what happens when we add up two bump functions, one inthe $x$ direction, the other in the $y$ direction, both of height $h$:</p><p><canvas id="xy_bump" width="300" height="270"></canvas> <spanid="xy_bump_3d" style="position: absolute; left: 360px;"></span></p><p>To simplify the diagram I've dropped the connections with zero weight.For now, I've left in the little $x$ and $y$ markers on the hiddenneurons, to remind you in what directions the bump functions are beingcomputed.  We'll drop even those markers later, since they're impliedby the input variable.</p><p>Try varying the parameter $h$.  As you can see, this causes the outputweights to change, and also the heights of both the $x$ and $y$ bumpfunctions.</p><p>What we've built looks a little like a <em>tower</em> function:</p><p><center> <span id="tower" style="position: absolute;"></span> <divstyle="height: 230px"></div> </center></p><p>If we could build such tower functions, then we could use them toapproximate arbitrary functions, just by adding up many towers ofdifferent heights, and in different locations:</p><p><center> <span id="many_towers" style="position: absolute;"></span><div style="height: 230px"></div> </center></p><p>Of course, we haven't yet figured out how to build a tower function.What we have constructed looks like a central tower, of height $2h$,with a surrounding plateau, of height $h$.</p><p>But we can make a tower function.  Remember that earlier we sawneurons can be used to implement a type of <tt>if-then-else</tt>statement:</p><p><div class="highlight"><pre><span></span>    <span class="k">if</span> input &gt;<span class="o">=</span> threshold: 
        output 1
    <span class="k">else</span>:
        output 0
</pre></div>
</p><p>That was for a neuron with just a single input.  What we want is toapply a similar idea to the combined output from the hidden neurons:</p><p><div class="highlight"><pre><span></span>    <span class="k">if</span> combined output from hidden neurons &gt;<span class="o">=</span> threshold:
        output 1
    <span class="k">else</span>:
        output 0
</pre></div>
</p><p>If we choose the <tt>threshold</tt> appropriately - say, a value of$3h/2$, which is sandwiched between the height of the plateau and theheight of the central tower - we could squash the plateau down tozero, and leave just the tower standing.</p><p>Can you see how to do this?  Try experimenting with the followingnetwork to figure it out.  Note that we're now plotting the outputfrom the entire network, not just the weighted output from the hiddenlayer.  This means we add a bias term to the weighted output from thehidden layer, and apply the sigma function.  Can you find values for$h$ and $b$ which produce a tower?  This is a bit tricky, so if youthink about this for a while and remain stuck, here's two hints: (1)To get the output neuron to show the right kind of <tt>if-then-else</tt>behaviour, we need the input weights (all $h$ or $-h$) to be large;and (2) the value of $b$ determines the scale of the<tt>if-then-else</tt> threshold.</p><p><canvas id="tower_construction" width="300" height="270"></canvas><span id="tower_construction_3d" style="position: absolute; left:350px;"></span></p><p>With our initial parameters, the output looks like a flattened versionof the earlier diagram, with its tower and plateau.  To get thedesired behaviour, we increase the parameter $h$ until it becomeslarge.  That gives the <tt>if-then-else</tt> thresholdingbehaviour.  Second, to get the threshold right, we'll choose $b\approx -3h/2$.  Try it, and see how it works!</p><p>Here's what it looks like, when we use $h = 10$:</p><p><div>	<div id="c" class="videoOverlay" 	     style="width: 556px; height: 284px; opacity: 0.8" 	     onclick="playVideo('c');">	  <img style="left: 210px; top: 80px;" 	       src="images/play.png" width="128px">	</div>	  <video id="vc" width="556" height="284" preload 		 onended="videoEnded('c');">	    <source type="video/mp4"		    src="movies/tower_construction.mp4">	    <source type="video/webm"		    src="movies/tower_construction.webm"></video></div></p><p>Even for this relatively modest value of $h$, we get a pretty goodtower function.  And, of course, we can make it as good as we want byincreasing $h$ still further, and keeping the bias as $b = -3h/2$.</p><p>Let's try gluing two such networks together, in order to compute twodifferent tower functions.  To make the respective roles of the twosub-networks clear I've put them in separate boxes, below: each boxcomputes a tower function, using the technique described above.  Thegraph on the right shows the weighted output from the <em>second</em>hidden layer, that is, it's a weighted combination of tower functions.</p><p><canvas id="the_two_towers" width="320" height="580"></canvas> <spanid="the_two_towers_3d" style="position: absolute; left: 370px;margin-top: 180px;"></span></p><p>In particular, you can see that by modifying the weights in the finallayer you can change the height of the output towers.</p><p>The same idea can be used to compute as many towers as we like.  Wecan also make them as thin as we like, and whatever height we like.As a result, we can ensure that the weighted output from the secondhidden layer approximates any desired function of two variables:</p><p><center> <span id="many_towers_2" style="position: absolute;"></span><div style="height: 230px"></div> </center></p><p>In particular, by making the weighted output from the second hiddenlayer a good approximation to $\sigma^{-1} \circ f$, we ensure theoutput from our network will be a good approximation to any desiredfunction, $f$.</p><p>What about functions of more than two variables?</p><p>Let's try three variables $x_1, x_2, x_3$.  The following network canbe used to compute a tower function in four dimensions:</p><p><canvas id="tower_n_dim" width="300" height="410"></canvas></p><p>Here, the $x_1, x_2, x_3$ denote inputs to the network.  The $s_1,t_1$ and so on are step points for neurons - that is, all theweights in the first layer are large, and the biases are set to givethe step points $s_1, t_1, s_2, \ldots$.  The weights in the secondlayer alternate $+h, -h$, where $h$ is some very large number.  Andthe output bias is $-5h/2$.</p><p>This network computes a function which is $1$ provided threeconditions are met: $x_1$ is between $s_1$ and $t_1$; $x_2$ is between$s_2$ and $t_2$; and $x_3$ is between $s_3$ and $t_3$.  The network is$0$ everywhere else.  That is, it's a kind of tower which is $1$ in alittle region of input space, and $0$ everywhere else.</p><p>By gluing together many such networks we can get as many towers as wewant, and so approximate an arbitrary function of three variables.Exactly the same idea works in $m$ dimensions.  The only change neededis to make the output bias $(-m+1/2)h$, in order to get the right kindof sandwiching behavior to level the plateau.</p><p>Okay, so we now know how to use neural networks to approximate areal-valued function of many variables.  What about vector-valuedfunctions $f(x_1, \ldots, x_m) \in R^n$?  Of course, such a functioncan be regarded as just $n$ separate real-valued functions, $f^1(x_1,\ldots, x_m), f^2(x_1, \ldots, x_m)$, and so on.  So we create anetwork approximating $f^1$, another network for $f^2$, and so on.And then we simply glue all the networks together.  So that's alsoeasy to cope with.</p><p><h4><a name="problem_863961"></a><a href="chap4.html#problem_863961">Problem</a></h4><ul></p><p><li> We've seen how to use networks with two hidden layers to  approximate an arbitrary function.  Can you find a proof showing  that it's possible with just a single hidden layer?  As a hint, try  working in the case of just two input variables, and showing that:  (a) it's possible to get step functions not just in the $x$ or $y$  directions, but in an arbitrary direction; (b) by adding up many of  the constructions from part (a) it's possible to approximate a tower  function which is circular in shape, rather than rectangular; (c)  using these circular towers, it's possible to approximate an  arbitrary function.  To do part (c) it may help to use ideas from a  <a href="chap4.html#fixing_up_the_step_functions">bit later in this    chapter</a>.</p><p></ul></p><p><h3><a name="extension_beyond_sigmoid_neurons"></a><a href="chap4.html#extension_beyond_sigmoid_neurons">Extension beyond sigmoid neurons</a></h3></p><p>We've proved that networks made up of sigmoid neurons can compute anyfunction.  Recall that in a sigmoid neuron the inputs $x_1, x_2,\ldots$ result in the output $\sigma(\sum_j w_j x_j + b)$, where $w_j$are the weights, $b$ is the bias, and $\sigma$ is the sigmoidfunction:</p><p><center> <canvas id="sigmoid" width="500" height="200"></canvas></center></p><p>What if we consider a different type of neuron, one using some otheractivation function, $s(z)$:</p><p><center> <canvas id="sigmoid_like" width="500" height="200"></canvas></center></p><p>That is, we'll assume that if our neurons has inputs $x_1, x_2,\ldots$, weights $w_1, w_2, \ldots$ and bias $b$, then the output is$s(\sum_j w_j x_j + b)$.</p><p>We can use this activation function to get a step function, just as wedid with the sigmoid.  Try ramping up the weight in the following, sayto $w = 100$:</p><p><canvas id="ramping" width="600" height="200"></canvas></p><p>Just as with the sigmoid, this causes the activation function tocontract, and ultimately it becomes a very good approximation to astep function.  Try changing the bias, and you'll see that we can setthe position of the step to be wherever we choose.  And so we can useall the same tricks as before to compute any desired function.</p><p>What properties does $s(z)$ need to satisfy in order for this to work?We do need to assume that $s(z)$ is well-defined as $z \rightarrow-\infty$ and $z \rightarrow \infty$.  These two limits are the twovalues taken on by our step function.  We also need to assume thatthese limits are different from one another.  If they weren't, there'dbe no step, simply a flat graph!  But provided the activation function$s(z)$ satisfies these properties, neurons based on such an activationfunction are universal for computation.</p><p><h4><a name="problems_963556"></a><a href="chap4.html#problems_963556">Problems</a></h4><ul><li> Earlier in the book we met another type of neuron known as a <a  href="chap3.html#other_models_of_artificial_neuron">rectified linear  unit</a>.  Explain why such neurons don't satisfy the conditions  just given for universality.  Find a proof of universality showing  that rectified linear units are universal for computation.</p><p><li> Suppose we consider linear neurons, i.e., neurons with the  activation function $s(z) = z$.  Explain why linear neurons don't  satisfy the conditions just given for universality.  Show that such  neurons can't be used to do universal computation.</ul></p><p><h3><a name="fixing_up_the_step_functions"></a><a href="chap4.html#fixing_up_the_step_functions">Fixing up the step functions</a></h3></p><p>Up to now, we've been assuming that our neurons can produce stepfunctions exactly.  That's a pretty good approximation, but it is onlyan approximation.  In fact, there will be a narrow window of failure,illustrated in the following graph, in which the function behaves verydifferently from a step function:</p><p><canvas id="failure" width="220" height="200"></canvas></p><p>In these windows of failure the explanation I've given foruniversality will fail.</p><p>Now, it's not a terrible failure.  By making the weights input to theneurons big enough we can make these windows of failure as small as welike.  Certainly, we can make the window much narrower than I've shownabove - narrower, indeed, than our eye could see.  So perhaps wemight not worry too much about this problem.</p><p>Nonetheless, it'd be nice to have some way of addressing the problem.</p><p>In fact, the problem turns out to be easy to fix.  Let's look at thefix for neural networks computing functions with just one input andone output.  The same ideas work also to address the problem whenthere are more inputs and outputs.</p><p>In particular, suppose we want our network to compute some function,$f$.  As before, we do this by trying to design our network so thatthe weighted output from our hidden layer of neurons is $\sigma^{-1}\circ f(x)$:</p><p><center> <canvas id="inverted_function_2" width="340"height="300"></canvas> </center></p><p>If we were to do this using the technique described earlier, we'd usethe hidden neurons to produce a sequence of bump functions:</p><p><center> <canvas id="series_of_bumps" width="340"height="300"></canvas> </center></p><p>Again, I've exaggerated the size of the windows of failure, in orderto make them easier to see.  It should be pretty clear that if we addall these bump functions up we'll end up with a reasonableapproximation to $\sigma^{-1} \circ f(x)$, except within the windowsof failure.</p><p>Suppose that instead of using the approximation just described, we usea set of hidden neurons to compute an approximation to <em>half</em> ouroriginal goal function, i.e., to $\sigma^{-1} \circ f(x) / 2$.  Ofcourse, this looks just like a scaled down version of the last graph:</p><p><center> <canvas id="half_bumps" width="340" height="300"></canvas></center></p><p>And suppose we use another set of hidden neurons to compute anapproximation to $\sigma^{-1} \circ f(x)/ 2$, but with the bases ofthe bumps <em>shifted</em> by half the width of a bump:</p><p><center> <canvas id="shifted_bumps" width="340" height="300"></canvas></center></p><p>Now we have two different approximations to $\sigma^{-1} \circ f(x) /2$.  If we add up the two approximations we'll get an overallapproximation to $\sigma^{-1} \circ f(x)$.  That overall approximationwill still have failures in small windows.  But the problem will bemuch less than before.  The reason is that points in a failure windowfor one approximation won't be in a failure window for the other.  Andso the approximation will be a factor roughly $2$ better in thosewindows.</p><p>We could do even better by adding up a large number, $M$, ofoverlapping approximations to the function $\sigma^{-1} \circ f(x) /M$. Provided the windows of failure are narrow enough, a point willonly ever be in one window of failure.  And provided we're using alarge enough number $M$ of overlapping approximations, the result willbe an excellent overall approximation.</p><p><h3><a name="conclusion"></a><a href="chap4.html#conclusion">Conclusion</a></h3></p><p>The explanation for universality we've discussed is certainly not apractical prescription for how to compute using neural networks!  Inthis, it's much like proofs of universality for <tt>NAND</tt> gates andthe like.  For this reason, I've focused mostly on trying to make theconstruction clear and easy to follow, and not on optimizing thedetails of the construction.  However, you may find it a fun andinstructive exercise to see if you can improve the construction.</p><p>Although the result isn't directly useful in constructing networks,it's important because it takes off the table the question of whetherany particular function is computable using a neural network.  Theanswer to that question is always "yes".  So the right question toask is not whether any particular function is computable, but ratherwhat's a <em>good</em> way to compute the function.</p><p>The universality construction we've developed uses just two hiddenlayers to compute an arbitrary function.  Furthermore, as we'vediscussed, it's possible to get the same result with just a singlehidden layer.  Given this, you might wonder why we would ever beinterested in deep networks, i.e., networks with many hidden layers.Can't we simply replace those networks with shallow, single hiddenlayer networks?</p><p><span class="marginnote"><strong>Chapter  acknowledgments:</strong> Thanks to <a  href="http://jendodd.com">Jen Dodd</a> and <a  href="http://colah.github.io/about.html">Chris Olah</a> for many  discussions about universality in neural networks.  My thanks, in  particular, to Chris for suggesting the use of a lookup table to  prove universality.  The interactive visual form of the chapter is  inspired by the work of people such as <a  href="http://bost.ocks.org/mike/algorithms/">Mike Bostock</a>, <a  href="http://www-cs-students.stanford.edu/&#126;amitp/">Amit  Patel</a>, <a href="http://worrydream.com">Bret Victor</a>, and <a  href="http://acko.net/">Steven Wittens</a>. </span></p><p>While in principle that's possible, there are good practical reasonsto use deep networks.  As argued in<a href="chap1.html#toward_deep_learning">Chapter 1</a>, deep networkshave a hierarchical structure which makes them particularly welladapted to learn the hierarchies of knowledge that seem to be usefulin solving real-world problems.  Put more concretely, when attackingproblems such as image recognition, it helps to use a system thatunderstands not just individual pixels, but also increasingly morecomplex concepts: from edges to simple geometric shapes, all the wayup through complex, multi-object scenes. In later chapters, we'll seeevidence suggesting that deep networks do a better job than shallownetworks at learning such hierarchies of knowledge.  To sum up:universality tells us that neural networks can compute any function;and empirical evidence suggests that deep networks are the networksbest adapted to learn the functions useful in solving many real-worldproblems.</p><p><!-- Seems to be necessary to ensure the font loads --> <span style="font-family: MJX_Math; color: #fff;">.</span> <span style="font-family: MJX_Main; color: #fff;">.</span></p><p><script src="js/chap4.js"></script><script src="js/canvas.js"></script><script src="js/neuron.js"></script><script src="js/scrubbable.js"></script><script src="js/button.js"></script></p><p></div><div class="footer"> <span class="left_footer"> In academic work,
please cite this book as: Michael A. Nielsen, "Neural Networks and
Deep Learning", Determination Press, 2015

<br/>
<br/>

This work is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"
style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0
Unported License</a>.  This means you're free to copy, share, and
build on this book, but not to sell it.  If you're interested in
commercial use, please <a
href="mailto:mn@michaelnielsen.org">contact me</a>.
</span>
<span class="right_footer">
Last update: Thu Dec 26 15:26:33 2019
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>